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ABSTRACT 

This  thesis  develops  a  model  which  describes  how  errors 
enter  and  leave  an  operating  data  base  for  manpower  manage- 
ment.  The  model  describes  the  error  input  process  and  error 
distribution  in  the  data  base.   The  underlying  structure  for 
the  model  is  the  M/G/00  queue.   The  model  is  used  to  determine 
the  effect  of  a  change  in  the  input  error  rate  on  the  number 
of  errors  in  the  data  base.   An  upper  limit  is  determined  for 
this  rate  of  increase,  and  a  method  of  determining  a  minimum 
time  between  samples  in  the  worst  possible  case  is  proposed. 
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I.   INTRODUCTION 

Decision  making  processes  in  the  Department  of  Defense 
are  not  unlike  those  in  other  large  governmental  and  non- 
governmental industrial  enterprises.   During  the  past  fifteen 
years,  a  considerable  portion  of  the  logistics,  engineering, 
and  management  effort  has  been  computerized.   This  has  re- 
sulted in  a  considerable  number  of  support  and  reference  ADP 
files  that  constitute  the  data  input  for  the  computer.   The 
files,  or  data  bases,  vary  in  size  from  50,000  up  to  millions 
of  records.   In  terms  of  alphanumeric  characters,  some  of  the 
files  have  from  fifty  million  to  ten  billion  characters.   We 
will  use  the  operating  data  base  of  the  Navy's  Bureau  of 
Personnel  as  an  example  throughout  this  thesis,  but  the  model 
developed  is  generally  applicable  to  data  bases  in  the  logis- 
tics and  engineering  areas  as  well. 

The  Active  Duty  Enlisted  Master  Magnetic  Tape  Record 
(E.M.T.)  is  the  operating  data  base  which  this  thesis  addresses 
It  contains  550  systematically  arranged  alphanumeric  characters 
for  every  enlisted  man  on  active  duty,  approximately  600,000 
men.   These  alphanumeric  characters  represent  such  information 
as  name,  rate,  serial  number,  social  security  number,  age, 
race,  religion,  number  of  dependents,  GCT/ARI  scores,  home  of 
record,  years  of  formal  education,  pay  entry  base  date,  duty 
station,  and  many  others.   For  a  detailed  description  of  the 
contents  of  the  data  base,  see  Ref.  [4],   This  information  is 


used  by  manpower  managers  in  the  Bureau  of  Naval  Personnel 
to  facilitate  assignments,  to  fill  school  quotas,  to  determine 
force  parameters,  to  make  budget  and  end  strength  predictions 
and  many  other  manpower  management  decisions. 

Inputs  are  made  daily  to  the  Enlisted  Master  Tape  by 
every  reporting  unit  in  the  Navy.   This  is  done  in  the  form 
of  a  unit  diary.   The  diary  is  the  paper  that  is  submitted 
daily  to  an  ADP  center  for  editing,  coding,  and  eventual  in- 
sertion into  the  E.M.T.   For  example,  see  Fig.  1.   Information 
flows  from  the  reporting  units  to  the  ADP  centers  to  the 
change  routine  which  alters  the  E.M.T.   See  Ref.  [2], 

The  purpose  of  this  thesis  is  to  develop  a  model  which 
describes  how  errors  enter  and  leave  the  data  base  (E.M.T.) 
and  to  investigate  how  this  model  can  be  used  to  help  design 
sampling  methods  similar  to  those  in  standard  statistical 
quality  control  procedures.   It  does  not  address  format  edit- 
ing, which  is  covered  in  Ref.  [2],   This  function  takes  place 
at  the  ADP  center  and  during  the  change  routine.   If  the  format 
is  not  correct  for  the  type  of  data  element  being  changed,  the 
computer  will  not  perform  the  change  and  so  indicates.   We 
are  concerned  in  this  thesis  with  an  after-the-fact  evaluation 
of  the  data  in  the  E.M.T.   We  are  concerned,  then,  with  techni- 
cal editing.   Some  examples  of  technical  editing  are  these: 
correct  service  number,  correct  pay  entry  base  date,  correct 
rate,  correct  time  in  grade,  and  correct  duty  station.   The 
results  of  the  study  will  show  that  errors  arrive  in  a 
Poisson  fashion,  that  the  number  of  errors  in  the  base  has  a 
Poisson  Distribution,  and  we  show  how  this  distribution  changes 
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Figure  1.   Information  Flow  from  Reporting  Units  to  the 
Enlisted  Master  Tape. 


with  a  change  in  the  error  input  rate.   A  sampling  method  is 
described  and  an  upper  limit  for  the  rate  of  increase  of 
errors  in  the  system  is  determined. 

This  thesis  is  written  in  five  sections,  of  which  this  is 
the  first.   In  Section  II  we  model  the  input  process  by  which 
errors  enter  the  data  base.   We  show  that  they  enter  in  a 
Poisson  process.   In  Section  III  we  develop  a  model  for  the 
distribution  of  errors  in  the  data  base.   Our  assumptions  lead 
to  formulating  the  M/G/00  queue.   (The  infinite  server  Poisson 
queue.)   In  Section  IV  we  use  this  model  to  determine  the 
effect  of  a  change  in  the  input  error  rate  on  the  number  of 
errors  in  the  data  base.   An  upper  limit  is  determined  for  the 
rate  of  increase  of  the  mean  number  of  errors  in  the  data  base 
In  Section  V  we  use  the  results  of  Section  IV  to  help  design 
error  rate  sampling  procedures.   The  relationship  between  the 
size  of  the  sample  and  the  frequency  of  the  sampling  procedure 
is  described.   The  goal  is  to  design  a  sampling  procedure  in 
the  data  base  which  will  allow  an  early  detection  of  signif- 
icant changes  in  the  input  error  rate. 


II.   ARRIVAL  OF  ERRORS  INTO  THE  DATA  BASE 

Assume  a  large  number  of  possible  places  from  which  data 
can  come  each  day,  e.g.:   600,000  men,  each  with  50  data 
elements  =  30  x  10   possible  arrivals  each  day.   Only  about 

1000  changes  occur  per  day,  so  the  probability  any  data  element 
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changes  is  about  1000/30  x  10   -  .3  x  10   .   A  small  propor- 
tion of  these  are  in  error,  so  the  probability  an  error  arrives 
is  even  smaller. 

For  each  change  to  the  data  base  which  arrives,  we  define 
a  Bernoulli  random  variable  X..   When  change  i  is  made,  X. 
takes  on  a  value  of  zero  if  change  i  is  correct  and  of  one  if 
change  i  is  in  error.   When  k  changes  are  made,  the  number  of 
errors  which  occur  is  a  Binomial  random  variable  if  the  follow- 
ing two  conditions  are  satisfied:   First,  the  probability  of  an 
error  in  any  change  is  independent  of  other  changes.   Second, 
this  probability  is  constant,  i.e.:   the  X.'s  are  independent, 
identically  distributed  random  variables. 

It  seems  reasonable  to  assume  that  the  receipt  of  an  error 
from  one  reporting  unit  is  independent  of  the  receipt  of  an 
error  from  any  other  reporting  unit.   An  exception  might  be  a 
case  in  which  an  incorrect  directive  is  being  followed  by  all 
reporting  units.   This  is  the  type  of  case  which  we  would  like 
to  discover  has  occurred;  however,  for  normal  operations,  we 
can  assume  it  is  not  the  case.   We  have  no  data  to  test  whether 
or  not  the  X.'s  are  identically  distributed.   This  assumption 
implies,  for  example,  that  the  probability  an  arriving  data 


element  (such  as  pay  entry  base  date)  is  in  error  is  the  same 
no  matter  where  it  came  from  or  when  it  arrives. 

The  Poisson  approximation  to  the  Binomial  is  justified  in 
cases  where  n  is  large  and  p  is  small.   This  is  precisely  the 
case  under  consideration.   The  number  of  possible  changes,  n 
in  the  data  base  is  very  large  (-  30  x  10  ) ,  and  p,  the 
probability  of  receipt  of  an  error,  is  very  small.   Thus,  in 
the  remainder  of  the  thesis,  we  assume  that  errors  arrive  in 
the  data  base  in  a  Poisson  process. 
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III.   MODEL  FOR  THE  DISTRIBUTION  OF  ERRORS 

Assume  that  each  day  a  number  of  errors  arrive,  this  num- 
ber being  a  Poisson  random  variable.   Each  one  enters  the  data 
base  and  remains  there  until,  at  some  future  date,  this  data 
point  is  changed.   It  is  changed  by  either  of  the  following 
events: 

a.  a  correct  updated  version  of  the  data  point 

arrives  and  replaces  what  is  already  there,  or 

b.  an  incorrect  updated  version  of  the  data  point 

arrives  and  replaces  what  is  already  there. 

From  our  Poisson  assumptions,  we  know  that  the  events  a. 
and  b.  are  independent.   Since  the  probability  of  replacing  a 
data  point  which  is  currently  in  error  with  a  new  data  point 
which  is  also  in  error  is  very  small,  we  can  assume  that  almost 
all  the  time,  events  of  type  a.  are  the  only  ones  which  remove 
errors  from  the  data  base. 

Thus,  the  time  an  error  spends  in  the  data  base  is  essential- 
ly independent  of  the  arrival  stream  of  errors. 

Therefore  the  data  base  acts  like  the  M/G/00  queue.   (The 
infinite  server  queue,  with  Poisson  arrivals,  and  a  general 
service  time  distribution.) 

Assume  that  the  errors  arrive  at  the  data  base  at  rate  X , 
that  each  one  stays  in  the  data  base  a  random  time,  and  that 
service  time  (length  of  stay  in  the  data  base)  is  randomly 
distributed  as  G. 
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Now,  using  the  model  of  the  infinite  server  Poisson  queue, 
the  problem  of  solving  for  the  distribution  of  the  number  of 
errors  in  the  data  base  is  addressed.   For  the  M/G/00  queue, 
with  arrivals  at  rate  X    and  mean  service  time  (average  time  a 
data  point  spends  in  the  data  base)  of  1/y ,  the  number  of  errors 
in  the  system  in  steady  state  has  a  Poisson  distribution  with 
mean  X/y   [Ross  Ref.  1,  p.  18], 

In  general,  let  X(t)  denote  the  number  of  errors  in  the 
system  at  time  t,  where  we  start  with  no  errors  at  time  0.   We 
determine  the  distribution  of  X(t)  by  conditioning  on  N(t),  the 
total  number  of  errors  which  have  arrived  by  time  t.   By  condi- 
tioning, we  obtain 

oo  y\ 

p{X(t)=j}   =    I    p{X(t)=j|N(t)=n)e~Xt  -%T-L-  .  (i) 

n=0  n' 

The  possibility  that  an  error  which  arrives  at  time  x  will 

still  be  present  at  time  t  is  l-G(t-x).   Hence,  given  that 

N(t)=n,  it  follows  that  the  probability  an  arbitrary  one  of 

these  errors  is  still  present  at  time  t  is  given  by 

t  t        -, 

p   =    /  (l-Gtt-x))^  =    /  (i-G(x))~  ,  (2) 

0  t       0 

independently  of  the  others.   This  follows  since  we  know  that 
given  N(t)=n,  the  n  arrival  times  S, ,  .  ..,S   have  the  same 
distribution  as  the  order  statistics  corresponding  to  n  inde- 
pendent random  variables  uniformly  distributed  on  the  interval 
(0,t)  [Ross  Ref.  1,  p.  17] . 
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Hence, 

(>JU-P)n-J  jM),1/2 n 

p{X(t)=j|N(t)=n}   =  {  Q  j>n 

Thus,  by  (1)  we  have, 

p{X(t)=j}  =    I    (^)p3(l-p)n^  e"Xt  Ii*£ 
n-j  D  n* 


,-Xt  (Xtp)  j   y  (At(l-p))n  j 
'    1T~   J,   (n-j)! 


=   e 

=   C-^P  (*tp)j 
ji 


where 


ft(l-G(x))dx 
p  _   j  — _ . 

0    r 

That  is,  X(t)  has  a  Poisson  distribution  with  mean  = 

t 
X/  (1-G(x))dx  . 
0 
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IV.   CHANGE  IN  THE  INPUT  ERROR  RATE 

For  the  M/G/°°  queue  with  arrivals  at  rate  X  and  mean 
service  time  (average  time  a  data  point  spends  in  the  data 
base)  of  1/y ,    the  number  in  the  system  (errors  in  the  system) 
in  steady  state  has  a  Poisson  distribution  with  mean  X/u  [Ross 
Ref .  1,  pp.  17,  18,  19]  . 

Assume  that  for  t<t, ,  the  data  base  errors  have  been 
arriving  at  a  constant  rate  X  for  some  time  and  that  the  sys- 
tem is  in  steady  state.   At  t,  ,  the  error  rate  changes  to  a 
new  rate  3  which  for  simplicity  we  assume  is  greater  than  X . 
Thus  in  steady  state  at  this  new  rate,  the  expected  number  of 
errors  in  the  data  base  is  3/u .   We  wish  to  investigate  how 
fast  the  expected  number  of  errors  reaches  this  level.   (See 

Fig.  2.)   Let  X(t)  =  the  number  of  errors  in  the  data  base  at 

time  (t) ; 

and   Y(t,x)  =  the  number  of  errors  in  the  data  base  at 

time  (t+x)  that  were  in  the  data  base 
at  time  (t)  ,  x>0; 

and    Z(t,x)  =  the  number  of  errors  in  the  data  base  at 

(  time  (t+x)  that  arrived  during  the 
interval  (t,t+x). 


Then 


X(t+x)   =   Y(t,x)  +  Z(t,x).  (3) 


It  is  reasonable  to  assume  that  Y(t,x)  and  Z(t,x)  are 
independent  random  variables.   They  are  dependent  to  the 
extent  that  an  incoming  error  might  replace  an  error  that  is 
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E[No.  errors  in  data  base] 
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Figure  2.   Expected  Number  of  Errors  in  the  Data  Base 
vs.  Time. 
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already  there.   We  assume  that  such  an  event  occurs  only  very 
rarely. 

Our  problem  is  to  find  the  distribution  of  X(t,+x).   To 
do  this  we  must  find  the  distribution  of  Y(t,,x)  and  Z(t,,x). 

For  Y(t,,x),  we  have 

p{Y(t1/x)  =  k|x(t1)  =  n) 

=   p{ (n-k)  of  the  errors  in  the  system  at 
time  t,  ,  have  left  by  time  (t,+x)}. 

An  error  which  is  in  the  system  at  time  (t,),  has  left 
the  system  by  time  (t,+x)  if  and  only  if,  its  remaining  service 
time  is  no  more  than  x.   If  t,  is  an  arbitrarily  chosen  point, 
then  the  remaining  service  time  will  be  distributed  the  same  as 
an  equilibrium  excess  random  variable  [Ross  Ref.  1,  pp.  44-47]. 
That  is,  let  service  time  (S)  be  distributed  as  G,  and 
E(S)=l/u,  then  the  remaining  service  time  at  t..  for  an  arbi- 
trary error  is  distributed  G  (y)  where 

y 

G^(y)   =  y/  (1-G(u))du  .  (4) 

e  0 

With  this  notation,  we  have 

p{Y(t1,x)  |X(t1)=n)  =   (£)Ge(x)kGe(x)n"k  ,  (5) 

where  G  (x) =1-G  (x) . 
e       e 

We  also  know  that  [see  Ross  Ref.  1,  pp.  18,  19] 

x    n  -X/v 

p{X(t1)=n>  =   (£-)   ^-j n=0,l,2,...  .  (6) 

By  conditioning  on  X(t,)  we  have  that 
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The  inequality  (10)  ,  when  used  in  (9)  shows  that  the 
true  mean  value  function  is  bounded  above  by, 

E(X(t,+x))  =  X/m    +  (3-X)x   if  x  <  1/y 

=  3/y  if  x  >  1/y  . 

This  is  shown  in  Fig.  3. 

Figure  3  is  useful  in  that  it  shows  the  maximum  rate  at 
which  the  mean  number  of  errors  in  the  data  base  adjusts  to 
its  new  equilibrium  value  when  the  error  input  rate  changes 
This  will  give  us  an  idea  of  how  frequently  to  sample  the 
data  base,  to  see  if  error  rates  are  changing.   This  we  dis- 
cussed in  Section  V. 
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E[No.  errors  in  data  base] 


^  Time 


Figure  3.   Upper  Bound  on  the  Mean  of  a  Changing  Input  Error 
Rate. 
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V.   SAMPLING  PROCEDURE  AND  STATISTICAL  QUALITY  CONTROL 

Quality  control  of,  and  sampling  from,  a  population  which 
has  numerous  input  streams  all  subject  to  human  error  can  be 
approached  using  standard  quality  control  procedures  if  the 
following  assumption  is  made:   many  separate  reporting  units 
following  the  same  set  of  instructions  with  regard  to  diary 
and  service  record  entries  act  as  a  single  system. 

This  assumption  is  the  basis  for  the  quality  control 
procedures  now  being  used  in  a  naval  manpower  data  base,  the 
U.S.  Marine  Corps  Manpower  Management  System  (see  Ref.  [2]). 
The  method  used  is  as  follows:   In  order  to  determine  the 
fraction  of  errors  in  the  data  base,  a  sample  of  2500  (out  of 
about  200,000)  service  records  is  randomly  selected  and  compared 
with  the  source  documents.   The  sampled  data  is  sent,  via  U.S. 
mail,  to  the  individual's  reporting  unit,  under  a  cover  letter 
asking  for  match/mismatch  information  between  the  information 
in  the  data  base  and  the  correct  records  at  his  reporting  unit. 
Verification  of  the  data  is  limited  to  the  following:   match, 
mismatch,  or  "can't  find."   The  first  indicates  no  error.   The 
second  indicates  an  error.   The  third  indicates  a  case  which 
arises  when  an  individual  is  in  transit  between  duty  stations 
or  not  at  his  last  known  command.   This  last  case  could  occur 
for  many  reasons,  leave  and  temporary  duty  assignments  else- 
where being  the  most  likely.   These  cases  are  removed  from  the 
sample,  and  the  fractional  error  rate  for  a  given  type  of  data 
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element  is  found  by  dividing  the  total  mismatches  by  the  sum 
of  the  mismatches  and  matches. 

When  the  match/mismatch  information  is  returned,  the  mean 
and  variance  of  the  fraction  of  errors  for  each  element  is 
determined  as  follows.   Any  process  which  generates  output 
that  can  be  characterized  as  either  "correct"  or  "incorrect" 
for  which  each  generalizing  event  (trial)  is  independent  in 
the  sense  that  it  is  not  influenced  by  prior  events  and  does 
not  influence  subsequent  events  and  which  can  be  described  by 
a  single  parameter  giving  the  probability  of  correct  (or  incor- 
rect) events,  is  called  a  Bernoulli  process.   The  probability 
of  exactly  c  correct  (or  incorrect)  events  in  n  trials  of  such 
a  process  for  the  parameter  p  is  given  by  the  binomial  distribu- 
tion.  For  purposes  of  this  study,  the  finite  population  of 
elements  in  the  data  base  is  so  large  that  we  have  assumed  the 
population  to  be  infinite.   This  assumption  is  common  in  cases 
of  acceptance  sampling  [Fetter  Ref.  3,  Chapter  1] . 

Let  the  sample  size  be  n,  and  let  the  number  of  errors  be 
a  random  variable  X,  where  X  is  distributed  Binomial  (n,p) . 
Then 


p«(|),  so  that  E(p)  -  E(|)  =  i  E(X)  =  i(np)  =  p  . 


Also, 


Then 


Var(p)  =  Var(|)  =  ^  Var(X)  =  ~   npq  =  H2  , 

n  n 


o-     =  /EL  . 

P      n 
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Now,  we  know  that  E(X)  =  np,  and  Var(X)  =  npq,  thus 

a      — 

a   =  /npq  •"•  -*-  =  /ESL  =  a-  (11) 

x     ^    n      n      p  v   ' 

To  estimate  the  Variance  of  X  from  our  single  sample,  define 
a  2  =  Var (X) .   Then  let 

mm 

7 

a    =   npq   =   n(— )  (1 — )   =   X  -  —  .  (12) 

x      r^       n'    n  n 

Now  we   determine    if   this   estimate    is   biased,    and  we    find   that 

E[S    2]       =      E[X— ]       =      E[X]     -    E[— ] 
x    J  n  ln    J 

12  1  2    2 

=      np   -   -  E[X    ]       =      np   -   -    [np(l-p)+n   p    ] 

=      (n-l)p(l-p), 
since 

E[X2]       =      Var[X]    +    {E [X] }         =      npq   +   n2p2    . 

~  2 

Thus  we  see  that  the  above  definition  of  a    is  a  biased  estimate 

x 

2 

of  a   .   We  can  eliminate  the  bias  by  introducing  the  factor 

(" i )  into  Eq.  (12).   Thus, we  redefine 

0  2   =  -^r   X(l  -  -)  .  (13) 

x      n-1       n 

It  can  be  seen  from  the  following  calculations  that  this 
is  an  unbiased  estimate  of  the  variance. 

e[$x2]    =    j^t  <Erxi  -  k  E^x2]} 


=     H?x    [np   "   n(np(1_p)    +  n   p    )] 


n 


=      ^rj[ (n-l)p(l-p) ]       =      npq 
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With  this  estimate  of  the  variance,  the  width  of  the  95% 
confidence  interval  can  be  determined  as  followed:   for  the 
sample  of  2500  let  X  =  250. 
Then 


Then  by  (13) , 


Thus 


A   _    250       , 


a   2      =  -2-t    (X)  (1  -  -)   =   225. 
x      n-T         n 


"  ^15 

a        =   15    and    a-   =   ncnn   =   .006. 
x  p      2500 


With  this  estimate  of  the  standard  deviation  of  the  fraction 
of  error  (Fig.  4) ,  we  see  that  the  95%  confidence  interval  is 
of  width  2  x  (1.96)  x  (.006)  -  .024,  since  for  large  n, 
binomial  probabilities  may  be  approximated  by  the  normal  distribu- 
tion.  The  values  of  the  parameters  determined  from  this  sample 
are  taken  to  be  the  population  parameters.   When  considering 
samples  of  size  200,000  (the  whole  population)  and  of  size  200, 
with  reference  to  the  sample  2500,  plotted  in  Fig.  4,  it  can  be 
seen  that  the  width  of  the  confidence  interval  is  inversely 
proportional  to  the  sample  size.   For  a  sample  which  is  equal 
to  the  whole  population,  the  fraction  of  errors  discovered 
would  not  be  an  estimate,  but  would  in  fact  be  the  true  frac- 
tion of  errors.   The  width  of  the  confidence  interval  would  be 
zero,  as  shown  by  the  line  at  .1  in  Fig.  4. 

For  a  sample  of  size  200,  by  Eq.  (13)  we  would  have: 
$  2       n  ,„,  ,.,  X,    200 


thus 


H^OOU")  =  —(20)  (.9)   =  18  , 


Sx   =   4.25   and   S-   =  $J§£  =   .021. 
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Figure  4.   Comparison  of  Confidence  Interval  Width  for  Three 
Different  Sample  Sizes. 
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Then  the  95%  confidence  interval  is  of  width  2  x  (1.96)  x 
(.021)  a  .084. 

Sample  size  can  be  determined  using  the  above  procedure  if 
we  know  the  confidence  level  and  interval  width  we  desire  and 
the  population  parameter  (p) .   For  example,  if  p=.l,  and  the 
half  width  of  the  interval,  say  a,  =  .02,  and  if  we  desire  a 
95%  level  of  confidence  in  this  interval,  then  we  know  that 

1.96(a-)   =   .02 


2  -02   ~   m 

0-    =    t jr-?-        -   .  01 

p     1796 


then  by  (11)  we  have  that 


•01   =   sp   -   !r  or   »   =   !fc 


thus 


thus 


n 


x  /npq 

.01  TOT 


2      _        npq 
n        -      70~00T 

n  TWoT  .oooi —   "    yo°  • 

See  Figs.  5  and  6  for  a  plot  of  sample  sizes  vs.  width  of  the 
confidence  interval  for  p=.l  and  .05  respectively. 

Now  having  solved  for  the  mean  and  variance  of  the  fraction 
of  errors  in  a  sample  of  having  determined  the  sample  size 
which  must  be  used  in  order  to  have  a  desired  confidence 
interval,  we  are  now  able  to  address  the  question  of  frequency 
and  sampling  in  terms  of  the  minimum  time  between  samples. 

In  order  to  be  able  to  control  the  error  rate,  we  must 
be  able  to  predict  its  behavior.   By  the  methods  just  described, 
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Figure  6.  Plot  of  Sanple  Size  vs.  Confidence  Interval  Width  for  p=.05, 
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we  are  able  to  determine  the  limits  within  which  the  error 
rate  from  a  sample  should  lie.  By  assuming  that  the  same  set 
of  causal  factors  will  continue  to  operate  in  the  future,  it 
is  usually  possible  to  make  a  prediction  of  the  expected  be- 
havior of  the  system.   Then,  if  a  change  occurs  in  the  causal 
system  which  changes  the  error  rate,  this  fact  should  be 
quickly  apparent  through  an  increase  in  the  error  rate  in  the 
samples.   We  then  attempt  to  discover  the  cause  of  the  increase 
and  eliminate  it.   The  time  between  samples  becomes  an  important 
parameter  when  early  detection  of  changes  in  error  rate  is 
desired.   See  the  quality  control  chart  in  Fig.  7  [Fetter  Ref. 
3,  Chapters  1  through  3] . 

As  long  as  the  fraction  of  errors  stays  within  the  upper 
and  lower  confidence  bounds,  we  say  that  it  is  "in  control." 
When  it  leaves  this  interval  on  the  upper  side,  we  say  it  is 
"out  of  control."   If  we  were  to  pick  limits  of  +1.96  (a)  for 
our  confidence  interval,  we  would  have  a  95%  confidence  interval. 
-That  is,  only  five  times  out  of  a  hundred  would  we  think  the 
process  was  "in  control"  when  it  was  actually  "out  of  control." 
If  we  were  to  increase  the  limits  of  our  confidence  interval  to 
+3o ,    we  would  then  have  a  9  9+%  confidence  interval;  that  is,  only 
three  times  out  of  a  thousand  would  we  mistakenly  infer  that  the 
system  was  "in  control."   The  wider  the  limits,  the  greater  con- 
fidence we  have  in  our  interval.   On  the  other  hand,  wider 
limits  decrease  the  probability  that  a  change  in  the  process 
will  be  detected  quickly. 
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Figure  7.   Typical  Quality  Control  Chart. 
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Figure  8  shows  changes  in  the  average  error  from  X.  to  X ? , 
with  their  corresponding  normal  densities.   The  shaded  area 
represents  the  only  place  on  the  graph  of  the  new  density 
where  an  error  rate  will  be  determined  to  have  changed,  since 
any  other  place  in  the  new  density  is  still  within  the  limits 
of  the  old  density.   The  shaded  area  then  is  the  probability 
that  the  shift  will  be  detected  by  any  sample  taken  following 
the  change. 

Recall  Fig.  3  in  Section  IV  for  a  change  in  input  error 
rate.   Now  consider  the  worst  possible  case  of  input  errors, 
when  every  change  which  arrives  at  the  data  base  is  an  error. 
Then  if  at  time  t  and  thereafter  every  entry  to  the  data  base 
is  an  error,  the  number  of  errors  in  the  data  base  will  increase 
at  its  maximum  rate.   The  upper  bound  on  the  fraction  of  errors 
present  at  time  t,  say  p(t),  is  shown  in  Fig.  9  as  the  solid 
diagonal  line. 

We  draw  confidence  limits  about  p(t)  just  as  we  do  about 
the  steady  state  lines.   From  Fig.  9  it  can  be  seen  that  unless 
the  lower  confidence  bound  on  p(t)  exceeds  the  upper  confidence 
bound  on  the  p(0),  we  will  probably  not  detect  a  change  in  the 
error  rate. 

The  equation  of  p(t)  is  known  since  the  average  number  of 
changes  per  day  is  known,  and  since  all  of  these  changes  are 
assumed  to  be  in  error.   We  number  the  time  scale  so  that  the 
diagonal  starts  at  t=0. 

p(t)   =   p(0)  +y(l-p(0))t;  0<t<  1/u,  (14) 
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Figure  8.   Change  in  Average  Error  and  Normal  Densities 
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where  1/y  is  the  time  an  error  spends  in  the  data  base 
(assumed  constant  to  obtain  the  upper  bound  in  Fig.  9) .   Let 
2a (t)  be  the  width  of  the  95%  confidence  interval  at  t.   Then 
for  a  sample  of  size  n, 

1.96a 


a(t)   = 


n 


=    i.96/HtniEpTtyj       0<t<  1/v  i 


n 

(15) 

Let  F(t)  be  the  lower  confidence  limit  at  t.   Then 

F(t)   =   p(t)  -  a(t) 

=   p(t)  -  1.96/P(t)(1-££S  .    (16) 

v  n 

The   upper   confidence   limit   at   t=0    is 


p(0)    +    o(0)       =      p(0)    +    1.96/P(Q)  CESS1I   .  (17) 

We  take  the  minimum  time  between  samples  to  be  that  t  for 
which  (16)  and  (17)  are  equal.   That  is,  t  ,  the  time  between 
samples,  satisfies 

F(t  )   =   p(0)  +  1.96/P(QM1-pgH  .   (18) 

For  example,  consider  the  case  where  p(0)=.l,  a  sample  size 
n  of  1000,  a  confidence  interval  of  95%,  and  a  time  in  the 
data  base  of  300  days.   Then 
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p(t)   =   .1  ,  t<0 
and 

p(t)   =   1  +  .003t  ,   0<t<300  . 

Also 

...      ,  QC  / .09+.0024t-.000009t2      n,.  ,nn 
a(t)   =   1.96  / TOQO '   '   0<t<300   .  # 

Thus 

a(0)   =   .019   and  p(0)   =   .1   . 

From  Eq.  (18)  we  find  that 

t   -  14  days. 

That  is,  the  minimum  sampling  period  to  consider  is  two  weeks 
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